Open-Set Language Identification

نویسنده

  • Shervin Malmasi
چکیده

We present the first open-set language identification experiments using one-class classification models. We first highlight the shortcomings of traditional feature extractionmethods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One-Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, demonstrating the effectiveness of this approach for open-set language identification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Out-of-Set i-Vector Selection for Open-set Language Identification

Current language identification (LID) systems are based on an ivector classifier followed by a multi-class recognition back-end. Identification accuracy degrades considerably when LID systems face open-set data. In this study, we propose an approach to the problem of out of set (OOS) data detection in the context of open-set language identification. In our approach, each unlabeled i-vector in t...

متن کامل

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

Analysis of multitarget detection for speaker and language recognition

The general multitarget detection (open-set identification) task is the intersection of the more familiar tasks of close-set identification and open-set verification/detection. In the multitarget detection task, an input of unknown class is processed by a bank of parallel detectors and a decision is required as to whether the input is from among the target classes and, if so, which one. In this...

متن کامل

Finding and Identifying Text in 900+ Languages

This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, a...

متن کامل

Leveraging the open source ispell codebase for minority language analysis

The ispell family of spellcheckers is perhaps the single most widely ported and deployed open-source language tool. Here we describe how the SzóSzablya ‘WordSword’ project leverages ispell’s Hungarian descendant, HunSpell, to create a whole set of related tools that tackle a wide range of low-level NLP-related tasks such as character set normalization, language detection, spellchecking, stemmin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1707.04817  شماره 

صفحات  -

تاریخ انتشار 2017